The tasks included in this report are:
unique(measure_labels$task_group)
[1] "adaptive_n_back" "attention_network_task"
[3] "choice_reaction_time" "directed_forgetting"
[5] "dot_pattern_expectancy" "local_global_letter"
[7] "motor_selective_stop_signal" "recent_probes"
[9] "shape_matching" "simon"
[11] "stim_selective_stop_signal" "stop_signal"
[13] "stroop" "threebytwo"
adaptive_n_back
attention_network_task
choice_reaction_time
directed_forgetting
dot_pattern_expectancy
local_global_letter
motor_selective_stop_signal
recent_probes
shape_matching
simon
stim_selective_stop_signal
stop_signal
stroop
threebytwo
Before going on with the rest of the report first decide on whether there are significant differences in the distributions of the HDDM parameter estimates and their reliabilities depending on whether they are fit on the full sample (n=552) or retest sample (n=150) for t1 data.
Differences in distributions: using scaled differences
Does the distribution of scaled difference scores (between using n=150 or n=552) have a mean different than 0 allowing for random effects of parameter accounting for the different types of parameters? No.
if(!exists('hddm_pars')){
source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/hddm_pars_data.R')
}
summary(lmer(scaled_diff ~ rt_acc + (1|dv), hddm_pars))
Linear mixed model fit by REML ['lmerMod']
Formula: scaled_diff ~ rt_acc + (1 | dv)
Data: hddm_pars
REML criterion at convergence: 41695
Scaled residuals:
Min 1Q Median 3Q Max
-10.438 -0.542 -0.004 0.518 9.750
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 3.24e-33 5.69e-17
Residual 9.93e-01 9.97e-01
Number of obs: 14722, groups: dv, 103
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.82e-18 9.82e-03 0
rt_accnon-decision -2.70e-17 2.44e-02 0
rt_accthreshold 9.98e-18 2.25e-02 0
Correlation of Fixed Effects:
(Intr) rt_cc-
rt_ccnn-dcs -0.403
rt_ccthrshl -0.436 0.176
# Same result
# summary(MCMCglmm(scaled_diff ~ rt_acc, random = ~ dv, data=hddm_pars))
Are there differences in reliability depending which sample the HDDM parameters are estimated from? No.
Do hddm parameters differ in their reliability depending on whether they are derived from n=150 or n=552? No.
if(!exists('hddm_rels')){
source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/hddm_rel_data.R')}
hddm_rels = hddm_rels %>%
gather(sample, value, -dv, -rt_acc)
summary(lmer(value ~ sample*rt_acc + (1|dv), hddm_rels))
Linear mixed model fit by REML ['lmerMod']
Formula: value ~ sample * rt_acc + (1 | dv)
Data: hddm_rels
REML criterion at convergence: -1207
Scaled residuals:
Min 1Q Median 3Q Max
-3.492 -0.260 0.013 0.254 3.510
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.029746 0.172
Residual 0.000196 0.014
Number of obs: 440, groups: dv, 220
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.50848 0.01642 30.96
samplerefit -0.00331 0.00188 -1.76
rt_accnon-decision 0.03807 0.02889 1.32
rt_accthreshold -0.02978 0.02836 -1.05
samplerefit:rt_accnon-decision -0.00394 0.00331 -1.19
samplerefit:rt_accthreshold 0.00497 0.00325 1.53
Correlation of Fixed Effects:
(Intr) smplrf rt_cc- rt_cct smp:_-
samplerefit -0.057
rt_ccnn-dcs -0.568 0.033
rt_ccthrshl -0.579 0.033 0.329
smplrft:r_- 0.033 -0.568 -0.057 -0.019
smplrft:rt_ 0.033 -0.579 -0.019 -0.057 0.329
Do HDDM parameter estimates fit participant data better if the model prior is informed by a larger sample? To assess model fit we sampled from the posterior predictive for each subject and regressed the predicted RT’s on actual RT’s (using both raw RT’s as well as log-transformed RT’s). Our measure of fit is the (adjusted) \(R^2\) from these regressions for each participant.
Here we compare model fits between the same model using a prior using n=552 vs n=150.
Note: In calculating the fit statistics for the full sample (n=552) hierarchical models we used fewer samples from the posterior predictive due to computational constraints. We did, however, confirm that the number of samples from the posterior predictive did not have an effect on the fit statistics. Details of those analyses can be found here.
refit_fitstats %>%
left_join(t1_hierarchical_fitstats, by = c("subj_id", "task_name")) %>%
ggplot(aes(log_rsq_adj.x, log_rsq_adj.y))+
geom_point(aes(color=task_name))+
xlab("Adjusted r-sq for n=150")+
ylab("Adjusted r-sq for n=552")+
theme(legend.position = "none")+
geom_abline(slope = 1, intercept=0)
Warning: Column `subj_id` joining factors with different levels, coercing
to character vector
Warning: Removed 106 rows containing missing values (geom_point).
The fit statistics (looking here at the adjusted \(R^2\)s using log-transformed RT’s) for the same models using n=552 and n=150 for the priors are highly correlated.
tmp = refit_fitstats %>%
left_join(t1_hierarchical_fitstats, by = c("subj_id", "task_name"))
Warning: Column `subj_id` joining factors with different levels, coercing
to character vector
with(tmp, cor.test(log_rsq_adj.x, log_rsq_adj.y))
Pearson's product-moment correlation
data: log_rsq_adj.x and log_rsq_adj.y
t = 120, df = 1900, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9317 0.9425
sample estimates:
cor
0.9373
And they do not systematically differ from each other.
with(tmp, t.test(log_rsq_adj.x, log_rsq_adj.y, paired=T))
Paired t-test
data: log_rsq_adj.x and log_rsq_adj.y
t = -1.1, df = 1900, p-value = 0.3
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0006482 0.0001833
sample estimates:
mean of the differences
-0.0002325
Conclusion: Parameter estimates from refits to retest sample only for t1 data do not differ in any detectable way from estimates using the full t1 sample. They are also in principle more comparable to the parameter estimates from t2 and therefore used in the rest of this report.
Plot reliability point estimates comparing DDM measures to raw measures faceting for contrast measures.
fig_name = 'ddmvsraw_point.jpeg'
knitr::include_graphics(paste0(fig_path, fig_name))
Plot averaged bootstrapped reliability estimates per measure comparing DDM measures to raw measures faceting for contrast measures.
fig_name = 'ddmvsraw_boot.jpeg'
knitr::include_graphics(paste0(fig_path, fig_name))
Model testing if the reliability of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.
Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.
Conclusion: Interactive model is best.
mer1 = lmer(icc ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(icc ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(icc ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(icc ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
refitting model(s) with ML (instead of REML)
anova(mer1a, mer2, mer3)
refitting model(s) with ML (instead of REML)
rm(mer1, mer2, mer1a)
Raw measures do not significantly differ from ddm parameters in their reliability but non-contrast measures are significantly more reliable compared to contrast and condition measures.
summary(mer3)
Linear mixed model fit by REML ['lmerMod']
Formula: icc ~ ddm_raw * overall_difference + (1 | dv)
Data: boot_df %>% filter(rt_acc != "other")
REML criterion at convergence: -654256
Scaled residuals:
Min 1Q Median 3Q Max
-130.34 -0.36 0.03 0.40 17.35
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.0518 0.228
Residual 0.0162 0.127
Number of obs: 512000, groups: dv, 512
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.5947 0.0245 24.23
ddm_rawraw -0.0329 0.0395 -0.83
overall_differencecontrast -0.3660 0.0366 -9.99
overall_differencecondition -0.0865 0.0304 -2.84
ddm_rawraw:overall_differencecontrast 0.0790 0.0597 1.32
ddm_rawraw:overall_differencecondition -0.0743 0.0490 -1.52
Correlation of Fixed Effects:
(Intr) ddm_rw ovrll_dffrnccnt ovrll_dffrnccnd
ddm_rawraw -0.621
ovrll_dffrnccnt -0.670 0.416
ovrll_dffrnccnd -0.806 0.501 0.540
ddm_rwrw:vrll_dffrnccnt 0.411 -0.662 -0.614 -0.331
ddm_rwrw:vrll_dffrnccnd 0.501 -0.807 -0.336 -0.621
ddm_rwrw:vrll_dffrnccnt
ddm_rawraw
ovrll_dffrnccnt
ovrll_dffrnccnd
ddm_rwrw:vrll_dffrnccnt
ddm_rwrw:vrll_dffrnccnd 0.534
What is the best measure of individual difference for any measure that has both raw and DDM parameters?
Even though overall the ddm parameters are not significantly less reliable the most reliable measure is more frequently a raw measure. There are some examples of an EZ estimate being the best for a task as well. Regardless of raw vs ddm the best measure is always a non-contrast measure.
rel_df %>%
group_by(task_group) %>%
filter(icc == max(icc)) %>%
select(task_group, everything())
fig_name = 'ddmvsraw_varsubs.jpeg'
knitr::include_graphics(paste0(fig_path, fig_name))
fig_name = 'ddmvsraw_varresid.jpeg'
knitr::include_graphics(paste0(fig_path, fig_name))
Model testing if the percentage of between subjects variance of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.
Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.
Conclusion: Model with fixed effects for both is best.
mer1 = lmer(var_subs_pct ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(var_subs_pct ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(var_subs_pct ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(var_subs_pct ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
refitting model(s) with ML (instead of REML)
anova(mer1a, mer2, mer3)
refitting model(s) with ML (instead of REML)
rm(mer1, mer1a, mer3)
Contrast measures have lower between subjects variability (ie are worse individual difference measures). Raw and ddm measures do not differ significantly.
summary(mer2)
Linear mixed model fit by REML ['lmerMod']
Formula: var_subs_pct ~ ddm_raw + overall_difference + (1 | dv)
Data: boot_df %>% filter(rt_acc != "other")
REML criterion at convergence: 3845550
Scaled residuals:
Min 1Q Median 3Q Max
-4.485 -0.657 0.065 0.669 4.841
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 197 14.0
Residual 106 10.3
Number of obs: 512000, groups: dv, 512
Fixed effects:
Estimate Std. Error t value
(Intercept) 50.59 1.29 39.4
ddm_rawraw 2.30 1.28 1.8
overall_differencecontrast -8.02 1.78 -4.5
overall_differencecondition -2.57 1.47 -1.7
Correlation of Fixed Effects:
(Intr) ddm_rw ovrll_dffrnccnt
ddm_rawraw -0.383
ovrll_dffrnccnt -0.619 0.012
ovrll_dffrnccnd -0.745 -0.001 0.536
Model testing if the percentage of residual variance of raw measures differs from that of ddm estimates and if contrast measures differ from non-contrast measures.
Checking if both fixed effects of raw vs ddm and contrast vs non-contrast as well as their interaction is necessary.
Conclusion: Interactive model is best
mer1 = lmer(var_resid_pct ~ ddm_raw + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer1a = lmer(var_resid_pct ~ overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer2 = lmer(var_resid_pct ~ ddm_raw + overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
mer3 = lmer(var_resid_pct ~ ddm_raw * overall_difference + (1|dv), boot_df %>% filter(rt_acc != "other"))
anova(mer1, mer2, mer3)
refitting model(s) with ML (instead of REML)
anova(mer1a, mer2, mer3)
refitting model(s) with ML (instead of REML)
rm(mer1, mer1a, mer2)
Both contrast and condition measures have higher residual variance. Raw and ddm measures do not differ.
summary(mer3)
Linear mixed model fit by REML ['lmerMod']
Formula: var_resid_pct ~ ddm_raw * overall_difference + (1 | dv)
Data: boot_df %>% filter(rt_acc != "other")
REML criterion at convergence: 3308034
Scaled residuals:
Min 1Q Median 3Q Max
-6.179 -0.562 0.016 0.604 7.682
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 64.7 8.05
Residual 37.2 6.10
Number of obs: 512000, groups: dv, 512
Fixed effects:
Estimate Std. Error t value
(Intercept) 19.460 0.868 22.42
ddm_rawraw 1.279 1.397 0.92
overall_differencecontrast 12.009 1.295 9.27
overall_differencecondition 2.783 1.076 2.59
ddm_rawraw:overall_differencecontrast -1.387 2.111 -0.66
ddm_rawraw:overall_differencecondition 2.968 1.732 1.71
Correlation of Fixed Effects:
(Intr) ddm_rw ovrll_dffrnccnt ovrll_dffrnccnd
ddm_rawraw -0.621
ovrll_dffrnccnt -0.670 0.416
ovrll_dffrnccnd -0.806 0.501 0.540
ddm_rwrw:vrll_dffrnccnt 0.411 -0.662 -0.614 -0.331
ddm_rwrw:vrll_dffrnccnd 0.501 -0.807 -0.336 -0.621
ddm_rwrw:vrll_dffrnccnt
ddm_rawraw
ovrll_dffrnccnt
ovrll_dffrnccnd
ddm_rwrw:vrll_dffrnccnt
ddm_rwrw:vrll_dffrnccnd 0.534
rm(mer3)
Differences in DDM parameter reliability for t1 data using either n=552 or n=150 were reported above under T1 HDDM parameters. No meaningful differences exist between these two sample sizes.
But even 150 is a large sample size for psychological studies, especially forced choice reaction time tasks that are included in this report. Here we look at how the reliability for raw and ddm measures change for sample sizes that are more common in studies using these tasks (25, 50, 75, 100, 125, 150)
Note: Not refitting HDDM’s for each of these sample sizes since a. there were no differences in parameter stability for n=150 vs 552 and b. a more comprehensive comparison using non-hierarchical estimates and model fit indices will follow. [Should I revisit this? - 150 and 552 might be too large to lead to changes in parameter estimates but smaller samples that are more common in psych studies might sway estimates more]
Note: Some variables do not have enough variance to calculate reliability for difference sample sizes. These variables are:
>stroop.post_error_slowing
>simon.std_rt_error
>shape_matching.post_error_slowing
>directed_forgetting.post_error_slowing
>choice_reaction_time.post_error_slowing
>choice_reaction_time.std_rt_error
>dot_pattern_expectancy.post_error_slowing
>motor_selective_stop_signal.go_rt_std_error
>motor_selective_stop_signal.go_rt_error
>attention_network_task.post_error_slowing
>recent_probes.post_error_slowing
>simon.post_error_slowing
>dot_pattern_expectancy.BY_errors
source('/Users/zeynepenkavi/Dropbox/PoldrackLab/SRO_DDM_Analyses/code/workspace_scripts/ddm_reldf_sample_size.R')
Warning: Column `dv` joining factor and character vector, coercing into
character vector
Does the mean reliability change with sample size?
Yes. The larger the sample size the more reliable is a given measure on average. The largest increase in reliability is when shifting from 25 to 50 subjects. This is important because many studies using these measures have sample sizes <50 per group.
fig_name = 'rel_by_samplesize.jpeg'
knitr::include_graphics(paste0(fig_path, fig_name))
When <15 subjects are used to calculate the measures they are significantly less reliable.
summary(lmer(icc ~ sample_size + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: icc ~ sample_size + (1 | dv) + (1 | iteration)
Data: rel_df_sample_size
REML criterion at convergence: 2823504
Scaled residuals:
Min 1Q Median 3Q Max
-605.6 0.0 0.0 0.0 0.4
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.06526 0.2555
iteration (Intercept) 0.00427 0.0653
Residual 65.64977 8.1025
Number of obs: 402034, groups: dv, 505; iteration, 100
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.311251 0.024771 12.57
sample_size 0.001557 0.000316 4.92
Correlation of Fixed Effects:
(Intr)
sample_size -0.673
Are there differences between any other sample sizes? This ignores the differences between variables but there seems to be only differences between n=10 and all other larger sample size.
with(rel_df_sample_size_summary, pairwise.t.test(mean_icc, sample_size, p.adjust.method = "bonferroni"))
Pairwise comparisons using t tests with pooled SD
data: mean_icc and sample_size
10 15 20 25 50 75 100
15 2e-04 - - - - - -
20 9e-06 1 - - - - -
25 2e-06 1 1 - - - -
50 9e-08 1 1 1 - - -
75 4e-08 1 1 1 1 - -
100 3e-08 1 1 1 1 1 -
125 2e-08 1 1 1 1 1 1
P value adjustment method: bonferroni
Does the change in reliabiliity with sample size vary by variable type?
The changes do not differ by raw vs. ddm measures. It does differ by whether the measure is a contrast measure: For contrast measures all increases in reliability with sample size are larger than the increases for non-contrast measures. This implies that larger sample sizes are even more crucial for studies using contrast measures as trait-level measures.
tmp = rel_df_sample_size %>%
group_by(sample_size, ddm_raw, overall_difference) %>%
summarise(mean_icc = mean(icc, na.rm=T),
sem_icc = sem(icc),
mean_var_subs_pct = mean(var_subs_pct, na.rm=T),
sem_var_subs_pct = sem(var_subs_pct),
mean_var_ind_pct = mean(var_ind_pct, na.rm=T),
sem_var_ind_pct = sem(var_ind_pct),
mean_var_resid_pct = mean(var_resid_pct, na.rm=T),
sem_var_resid_pct = sem(var_resid_pct)) %>%
na.exclude()
rel_df_sample_size_summary %>%
filter(!is.na(overall_difference)) %>%
ggplot(aes(factor(sample_size), mean_icc))+
geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
geom_line(data = tmp, aes(factor(sample_size),mean_icc, color=ddm_raw, group=ddm_raw))+
geom_point(data = tmp, aes(factor(sample_size),mean_icc, color=ddm_raw))+
geom_errorbar(data = tmp,aes(ymin=mean_icc-sem_icc, ymax = mean_icc+sem_icc, color=ddm_raw), width = 0.1)+
facet_wrap(~overall_difference)+
ylab("Mean reliability of 100 samples of size n")+
xlab("Sample size")+
theme(legend.title = element_blank(),
legend.position = "bottom")+
ylim(-1,1)
Warning: Removed 122 rows containing missing values (geom_path).
summary(lmer(icc ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula:
icc ~ sample_size * overall_difference + (1 | dv) + (1 | iteration)
Data: rel_df_sample_size
REML criterion at convergence: 2805380
Scaled residuals:
Min 1Q Median 3Q Max
-603.4 0.0 0.0 0.0 0.3
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.04637 0.2153
iteration (Intercept) 0.00433 0.0658
Residual 66.14275 8.1328
Number of obs: 399034, groups: dv, 499; iteration, 100
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.518096 0.044356 11.68
sample_size 0.000621 0.000602 1.03
overall_differencecontrast -0.459839 0.065919 -6.98
overall_differencecondition -0.211001 0.054844 -3.85
sample_size:overall_differencecontrast 0.001230 0.000905 1.36
sample_size:overall_differencecondition 0.001343 0.000753 1.78
Correlation of Fixed Effects:
(Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size -0.713
ovrll_dffrnccnt -0.658 0.480
ovrll_dffrnccnd -0.791 0.577 0.532
smpl_sz:vrll_dffrnccnt 0.475 -0.665 -0.721 -0.384
smpl_sz:vrll_dffrnccnd 0.571 -0.800 -0.384 -0.721
smpl_sz:vrll_dffrnccnt
sample_size
ovrll_dffrnccnt
ovrll_dffrnccnd
smpl_sz:vrll_dffrnccnt
smpl_sz:vrll_dffrnccnd 0.532
Does variability of reliability change with sample size?
No.
rel_df_sample_size_summary %>%
na.exclude() %>%
ggplot(aes(sample_size, sem_icc))+
geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
geom_line(data = tmp, aes(sample_size,sem_icc, color=ddm_raw))+
geom_point(data = tmp, aes(sample_size,sem_icc, color=ddm_raw))+
facet_wrap(~overall_difference)+
ylab("Standard error of mean of reliability \n of 100 samples of size n")+
xlab("Sample size")+
theme(legend.title = element_blank(),
legend.position = "bottom")
summary(lmer(sem_icc ~ sample_size * overall_difference + (1|dv), rel_df_sample_size_summary))
Linear mixed model fit by REML ['lmerMod']
Formula: sem_icc ~ sample_size * overall_difference + (1 | dv)
Data: rel_df_sample_size_summary
REML criterion at convergence: 9708
Scaled residuals:
Min 1Q Median 3Q Max
-0.12 -0.06 -0.02 0.01 60.38
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 0.000213 0.0146
Residual 0.657690 0.8110
Number of obs: 3992, groups: dv, 499
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.038630 0.039761 0.97
sample_size -0.000334 0.000600 -0.56
overall_differencecontrast 0.048118 0.059791 0.80
overall_differencecondition 0.090710 0.049734 1.82
sample_size:overall_differencecontrast -0.000466 0.000902 -0.52
sample_size:overall_differencecondition -0.000986 0.000750 -1.31
Correlation of Fixed Effects:
(Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size -0.792
ovrll_dffrnccnt -0.665 0.527
ovrll_dffrnccnd -0.799 0.633 0.532
smpl_sz:vrll_dffrnccnt 0.527 -0.665 -0.792 -0.421
smpl_sz:vrll_dffrnccnd 0.633 -0.799 -0.421 -0.792
smpl_sz:vrll_dffrnccnt
sample_size
ovrll_dffrnccnt
ovrll_dffrnccnd
smpl_sz:vrll_dffrnccnt
smpl_sz:vrll_dffrnccnd 0.532
Does between subjects variance change with sample size?
Yes. Between subjects variance decreases with sample size. This is more pronounced for non-contrast measures.
This goes against my intuitions. Looking at the change in between subjects percentage of individual measures’ there seems to be a lot of inter-measure variance (more pronounced below for within subject variance). I’m not sure if there is something in common for the measures that show increasing between subjects variability with sample size and that separates them from those that show decreasing between subjects variability with sample size (the slight majority).
rel_df_sample_size_summary %>%
na.exclude() %>%
ggplot(aes(sample_size, mean_var_subs_pct))+
geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
geom_line(data = tmp, aes(sample_size,mean_var_subs_pct, color=ddm_raw))+
geom_point(data = tmp, aes(sample_size,mean_var_subs_pct, color=ddm_raw))+
facet_wrap(~overall_difference)+
ylab("Mean percentage of \n between subjects variance \n of 100 samples of size n")+
xlab("Sample size")+
theme(legend.title = element_blank(),
legend.position = "bottom")
summary(lmer(var_subs_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: var_subs_pct ~ sample_size * overall_difference + (1 | dv) +
(1 | iteration)
Data: rel_df_sample_size
REML criterion at convergence: 3274176
Scaled residuals:
Min 1Q Median 3Q Max
-4.957 -0.705 0.054 0.704 4.556
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 109.05 10.44
iteration (Intercept) 1.26 1.12
Residual 212.51 14.58
Number of obs: 399034, groups: dv, 499; iteration, 100
Fixed effects:
Estimate Std. Error t value
(Intercept) 58.88947 0.89254 66.0
sample_size -0.08360 0.00108 -77.5
overall_differencecontrast -14.86635 1.33152 -11.2
overall_differencecondition -5.22575 1.10755 -4.7
sample_size:overall_differencecontrast 0.07402 0.00162 45.6
sample_size:overall_differencecondition 0.03666 0.00135 27.2
Correlation of Fixed Effects:
(Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size -0.064
ovrll_dffrnccnt -0.660 0.043
ovrll_dffrnccnd -0.793 0.051 0.532
smpl_sz:vrll_dffrnccnt 0.042 -0.665 -0.064 -0.034
smpl_sz:vrll_dffrnccnd 0.051 -0.800 -0.034 -0.064
smpl_sz:vrll_dffrnccnt
sample_size
ovrll_dffrnccnt
ovrll_dffrnccnd
smpl_sz:vrll_dffrnccnt
smpl_sz:vrll_dffrnccnd 0.532
Does within subjects variance change with sample size?
Yes. This again goes against my intuition but here the inter-meausre differences are even more pronounced. There appears to be some measures for which the change in two measurements at different time points is larger the more subjects are tested and those that show a smaller decrease in within subject variance with larger sample sizes. I still don’t know if these two types of measures have anything that distinguishes them.
rel_df_sample_size_summary %>%
na.exclude() %>%
ggplot(aes(sample_size, mean_var_ind_pct))+
geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
geom_line(data = tmp, aes(sample_size,mean_var_ind_pct, color=ddm_raw))+
geom_point(data = tmp, aes(sample_size,mean_var_ind_pct, color=ddm_raw))+
facet_wrap(~overall_difference)+
ylab("Mean percentage of \n within subjects variance \n of 100 samples of size n")+
xlab("Sample size")+
theme(legend.title = element_blank(),
legend.position = "bottom")
summary(lmer(var_ind_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: var_ind_pct ~ sample_size * overall_difference + (1 | dv) + (1 |
iteration)
Data: rel_df_sample_size
REML criterion at convergence: 3471611
Scaled residuals:
Min 1Q Median 3Q Max
-4.539 -0.749 -0.171 0.698 4.331
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 132.28 11.50
iteration (Intercept) 1.41 1.19
Residual 348.71 18.67
Number of obs: 399034, groups: dv, 499; iteration, 100
Fixed effects:
Estimate Std. Error t value
(Intercept) 18.28380 0.98355 18.6
sample_size 0.11368 0.00138 82.2
overall_differencecontrast 5.17440 1.46817 3.5
overall_differencecondition 2.01577 1.22121 1.7
sample_size:overall_differencecontrast -0.09479 0.00208 -45.6
sample_size:overall_differencecondition -0.03786 0.00173 -21.9
Correlation of Fixed Effects:
(Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size -0.074
ovrll_dffrnccnt -0.660 0.050
ovrll_dffrnccnd -0.794 0.060 0.532
smpl_sz:vrll_dffrnccnt 0.049 -0.665 -0.074 -0.040
smpl_sz:vrll_dffrnccnd 0.059 -0.800 -0.040 -0.074
smpl_sz:vrll_dffrnccnt
sample_size
ovrll_dffrnccnt
ovrll_dffrnccnd
smpl_sz:vrll_dffrnccnt
smpl_sz:vrll_dffrnccnd 0.532
Does residual variance change with sample size?
rel_df_sample_size_summary %>%
na.exclude() %>%
ggplot(aes(sample_size, mean_var_resid_pct))+
geom_line(aes(group = dv, color=ddm_raw), alpha = 0.1)+
geom_line(data = tmp, aes(sample_size,mean_var_resid_pct, color=ddm_raw))+
geom_point(data = tmp, aes(sample_size,mean_var_resid_pct, color=ddm_raw))+
facet_wrap(~overall_difference)+
ylab("Mean percentage of residual variance \n of 100 samples of size n")+
xlab("Sample size")+
theme(legend.title = element_blank(),
legend.position = "bottom")
summary(lmer(var_resid_pct ~ sample_size * overall_difference + (1|dv) + (1|iteration), rel_df_sample_size))
Linear mixed model fit by REML ['lmerMod']
Formula: var_resid_pct ~ sample_size * overall_difference + (1 | dv) +
(1 | iteration)
Data: rel_df_sample_size
REML criterion at convergence: 2942570
Scaled residuals:
Min 1Q Median 3Q Max
-4.562 -0.624 -0.040 0.559 8.279
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 39.93 6.319
iteration (Intercept) 0.44 0.663
Residual 92.59 9.623
Number of obs: 399034, groups: dv, 499; iteration, 100
Fixed effects:
Estimate Std. Error t value
(Intercept) 22.826752 0.540234 42.3
sample_size -0.030083 0.000712 -42.2
overall_differencecontrast 9.691933 0.806229 12.0
overall_differencecondition 3.209963 0.670616 4.8
sample_size:overall_differencecontrast 0.020769 0.001071 19.4
sample_size:overall_differencecondition 0.001197 0.000891 1.3
Correlation of Fixed Effects:
(Intr) smpl_s ovrll_dffrnccnt ovrll_dffrnccnd
sample_size -0.069
ovrll_dffrnccnt -0.660 0.046
ovrll_dffrnccnd -0.793 0.056 0.532
smpl_sz:vrll_dffrnccnt 0.046 -0.665 -0.070 -0.037
smpl_sz:vrll_dffrnccnd 0.055 -0.800 -0.037 -0.070
smpl_sz:vrll_dffrnccnt
sample_size
ovrll_dffrnccnt
ovrll_dffrnccnd
smpl_sz:vrll_dffrnccnt
smpl_sz:vrll_dffrnccnd 0.532
Conclusion: Larger samples are better for reliability but not necessarily always for the same reasons; for some variables this is due to increasing between subjects variance while for others it’s due to decreasing residual variance (?).
rm(rel_df_sample_size, rel_df_sample_size_summary)
The H in HDDM estimates stands for ‘hierarchical’ denoting the fact that the distribution of the whole sample is incorportated in the priors for the model parameters. This is different than e.g. EZ-diffusion parameters that would be the same for a given subject’s data regardless of the rest of the sample. One can ask whether this H indeed has a measurable impact.
HDDM and EZ parameter estimates might differ due to other difference in the estimation process as well. Therefore to evaluate whether the ‘H’ leads to meaningful changes in parameters we compare the same models fit on the same data using either the whole sample for the hierarchical structure or only the single subject’s data. These latter estimates are referred to as ‘flat’ estimates.
retest_hddm_flat = read.csv(paste0(retest_data_path,'retest_hddm_flat.csv'))
retest_hddm_flat = retest_hddm_flat %>% rename(sub_id = subj_id)
test_hddm_flat = read.csv(paste0(retest_data_path,'/t1_data/t1_hddm_flat.csv'))
test_hddm_flat = test_hddm_flat %>% rename(sub_id = subj_id)
# numeric_cols = get_numeric_cols()
# Check if all the variables are there (no for now)
# sum(names(retest_hddm_flat) %in% numeric_cols) == length(names(retest_hddm_flat))
# names(retest_hddm_flat)[which(names(retest_hddm_flat) %in% numeric_cols == FALSE)]
# sum(names(test_hddm_flat) %in% numeric_cols) == length(names(test_hddm_flat))
# names(test_hddm_flat)[which(names(test_hddm_flat) %in% numeric_cols == FALSE)]
Plot percent change in raw parameters for retest and t1 for each of 3 parameters (using hierarchical as baseline: what percent does the flat parameter change compared to the hierarchical)
For both time points the thresholds and non-decision times are very similary regardless of whether they are estimated heirarchically or without the hierarchy (difference peaking at 0).
This is also true for the majority of the drift rates. There are, however, almost as many drift rates that change completely when estimated without the hierarchy.
#Should not be necessary later
common_cols = names(retest_hddm_flat)[names(retest_hddm_flat) %in% names(retest_data)]
common_cols=common_cols[common_cols %in% names(test_hddm_flat)]
retest_hddm_hier = retest_data %>% select(common_cols) %>% mutate(hddm="hierarchical")
test_hddm_hier = test_data %>% select(common_cols) %>% mutate(hddm="hierarchical")
retest_hddm_flat = retest_hddm_flat %>% select(common_cols) %>% mutate(hddm="flat")
test_hddm_flat = test_hddm_flat %>% select(common_cols) %>% mutate(hddm="flat")
retest_flat_difference = rbind(retest_hddm_hier, retest_hddm_flat)
retest_flat_difference = retest_flat_difference %>%
gather(dv, value, -sub_id, -hddm) %>%
spread(hddm, value) %>%
mutate(diff_pct = (hierarchical - flat)/hierarchical*100,
diff_pct = ifelse(diff_pct<(-100), -100, ifelse(diff_pct>100, 100, diff_pct)),
time = "retest",
par = ifelse(grepl("drift", dv), "drift", ifelse(grepl("thresh", dv), "thresh", ifelse(grepl("non_decision", dv), "non_decision", NA))))
test_flat_difference = rbind(test_hddm_hier, test_hddm_flat)
test_flat_difference = test_flat_difference %>%
gather(dv, value, -sub_id, -hddm) %>%
spread(hddm, value) %>%
mutate(diff_pct = (hierarchical - flat)/hierarchical*100,
diff_pct = ifelse(diff_pct<(-100), -100, ifelse(diff_pct>100, 100, diff_pct)),
time = "test",
par = ifelse(grepl("drift", dv), "drift", ifelse(grepl("thresh", dv), "thresh", ifelse(grepl("non_decision", dv), "non_decision", NA))))
flat_difference = rbind(test_flat_difference, retest_flat_difference)
flat_difference %>%
ggplot(aes(diff_pct))+
geom_histogram()+
facet_grid(factor(time, levels = c("test", "retest"),labels = c("test", "retest")) ~ par, scales="free")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1022 rows containing non-finite values (stat_bin).
Is the average percentage change different than 0? No.
This model uses the distribution of percentage difference in parameter estimates for drift rates as the baseline. A more appropriate model would test whether all three distributions are different than 0. The conclusion should not change: The intercept suggests that the mean of the drift rate difference distribution is at 0 and this is not different than the distributions for either of the other parameters and either time point.
summary(lmer(diff_pct ~ time*par+(1|dv), flat_difference %>% mutate(pars = factor(par, levels = c("thresh", "drift", "non_decision")))))
Linear mixed model fit by REML ['lmerMod']
Formula: diff_pct ~ time * par + (1 | dv)
Data:
flat_difference %>% mutate(pars = factor(par, levels = c("thresh",
"drift", "non_decision")))
REML criterion at convergence: 238677
Scaled residuals:
Min 1Q Median 3Q Max
-3.244 -0.258 -0.016 0.213 3.853
Random effects:
Groups Name Variance Std.Dev.
dv (Intercept) 130 11.4
Residual 1443 38.0
Number of obs: 23578, groups: dv, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) -1.19 1.64 -0.73
timetest 1.64 0.62 2.65
parnon_decision 1.65 3.57 0.46
parthresh -1.84 3.39 -0.54
timetest:parnon_decision -1.38 1.35 -1.02
timetest:parthresh -1.71 1.28 -1.33
Correlation of Fixed Effects:
(Intr) timtst prnn_d prthrs tmts:_
timetest -0.191
parnon_dcsn -0.460 0.088
parthresh -0.485 0.093 0.223
tmtst:prnn_ 0.088 -0.459 -0.191 -0.043
tmtst:prthr 0.092 -0.483 -0.042 -0.192 0.222
Do people change in the same way? Mostly.
Plotting the raw parameter estimate that is estimated hierarchically against the estimate that is estimated without the hierarchy. Red lines are the 45-degree line. The fact that many points cluster around this line and that there are systematic deviances from it for any parameter at either time point suggests that the hierarchical and flat estimates are mostly the same.
flat_difference %>%
ggplot(aes(hierarchical, flat))+
geom_point()+
geom_abline(aes(intercept=0, slope=1), color="red")+
facet_grid(factor(time, levels = c("test", "retest"),labels = c("test", "retest")) ~ par)
Warning: Removed 1022 rows containing missing values (geom_point).
Plotting the reliability estimates for flat parameters against the reliability of hierarchical estimates. Red lines are 45-degree lines. While there are some changes in reliability nothing appears very large (or consequential in pushing the reliability to any acceptable level depending on the estimation method) or systematic.
rel_df_flat = make_rel_df(t1_df = test_hddm_flat, t2_df = retest_hddm_flat, metrics = c('icc', 'pearson', 'var_breakdown'))
rel_df_flat = rel_df_flat %>%
left_join(rel_df[,c("dv", "icc", "rt_acc", "overall_difference")], by = "dv")
rel_df_flat %>%
ggplot(aes(icc.y, icc.x))+
geom_point()+
geom_abline(aes(slope=1, intercept = 0), color="red")+
facet_wrap(~rt_acc)+
xlab("Hierarchical reliability")+
ylab("Flat reliability")
with(rel_df_flat %>% filter(rt_acc == "drift rate"), t.test(icc.x, icc.y, paired=T))
Paired t-test
data: icc.x and icc.y
t = 0.53, df = 51, p-value = 0.6
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02018 0.03454
sample estimates:
mean of the differences
0.007181
with(rel_df_flat %>% filter(rt_acc == "threshold"), t.test(icc.x, icc.y, paired=T))
Paired t-test
data: icc.x and icc.y
t = -2, df = 15, p-value = 0.06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.088179 0.002222
sample estimates:
mean of the differences
-0.04298
with(rel_df_flat %>% filter(rt_acc == "non-decision"), t.test(icc.x, icc.y, paired=T))
Paired t-test
data: icc.x and icc.y
t = -0.19, df = 13, p-value = 0.9
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.04691 0.03948
sample estimates:
mean of the differences
-0.003715
rm(flat_difference, rel_df_flat, retest_flat_difference, retest_hddm_flat, retest_hddm_hier, test_flat_difference, test_hddm_flat, test_hddm_hier)
Do the fit statistics differ by whether the model was hierarchical or not? No.
t1_hierarchical_fitstats = t1_hierarchical_fitstats %>%
filter(subj_id %in% retest_hierarchical_fitstats$subj_id)
tmp = rbind(retest_hierarchical_fitstats, retest_flat_fitstats, t1_hierarchical_fitstats, t1_flat_fitstats) %>%
separate(sample, c("time", "proc"), sep="_") %>%
select(log_rsq_adj, time, proc, subj_id, task_name) %>%
spread(proc, log_rsq_adj)
tmp %>%
ggplot(aes(hierarchical, flat))+
geom_point()+
geom_abline(aes(slope=1, intercept=0), color="red")+
facet_wrap(~time)
Warning: Removed 271 rows containing missing values (geom_point).
with(tmp %>% filter(time == "t1"), cor.test(flat, hierarchical))
Pearson's product-moment correlation
data: flat and hierarchical
t = 59, df = 1900, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7855 0.8174
sample estimates:
cor
0.8021
with(tmp %>% filter(time == "retest"), cor.test(flat, hierarchical))
Pearson's product-moment correlation
data: flat and hierarchical
t = 62, df = 2000, p-value <2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7942 0.8244
sample estimates:
cor
0.8098
with(tmp %>% filter(time == "t1"), t.test(flat, hierarchical, paired=TRUE))
Paired t-test
data: flat and hierarchical
t = 0.16, df = 1900, p-value = 0.9
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0007680 0.0009064
sample estimates:
mean of the differences
0.00006916
with(tmp %>% filter(time == "retest"), t.test(flat, hierarchical, paired=TRUE))
Paired t-test
data: flat and hierarchical
t = 1.2, df = 2000, p-value = 0.2
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0002551 0.0010165
sample estimates:
mean of the differences
0.0003807
– Do DDM parameters capture similar processes as the raw measures in a given task or do they capture processes that are more similar across tasks? (If the former they would be less useful than if the latter.)
This could be analyzed with factor analysis but there are more variables than observations so as a first pass we’ll explore correlations.
For this we correlate each DDM measure with:
1. raw measures within task
2. other ddm measures across tasks
(Absolute) correlations between raw and ddm measures within a task are higher than those between ddm measures across tasks.
Factor analysis
– Are these clusters more reliable than using either the raw or the DDM measures alone?
– Do raw or DDM measures (or factor scores) predict real world outcomes better?